CTA-aware Prefetching for GPGPU

نویسندگان

Hyeran Jeon

Gunjae Koo

Murali Annavaram

Ming Hsieh

چکیده

Several studies have been proposed to adopt memory prefetching schemes to reduce performance impact of long latency memory operations in GPUs. By leveraging a simple intuition that the consecutive warps are likely to have spatial locality, prior approaches prefetch two or four consecutive cache lines when there is a cache miss. Other approaches predict striding accesses by detecting base address and stride value from each warp’s load address history. Warp-based load prediction works well when a load instruction is repeatedly executed in a loop. But to exploit parallelism, GPU kernels favor creating a massive number of threads rather than sequential loop codes. In this paper, we exploit the observation that all the threads generated from a kernel execute the same code segment. When threads are grouped into cooperative tread arrays (CTAs), each thread uses thread id and CTA id to identify the data that it operates on. Thus the stride values for loads among warps within a cooperative tread array (CTA) can be easily detected and this stride value can be applied across all the CTAs in the kernel. However, the starting base address accessed by the first warp in a CTA is difficult to predict since that starting address depends on how the CTAs are created by the application programmer. Hence, we propose to first compute the base address of a load in each CTA by using a leading warp. The leading warp of each CTA is executed early by pairing it with warps from currently executing CTA. Thus our proposed CTA-aware prefetch predicts all the trailing warps’ load addresses by first computing the base address early through a leading warp and then using the stride value. Through simple enhancements to the existing two-level scheduler, prefetches can be issued sufficiently ahead of time before the demand requests. CTA-aware prefetch predicts addresses with over 94% accuracy and is able to improve performance by 5.4%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context-Aware Prefetching at the Storage Server

In many of today’s applications, access to storage constitutes the major cost of processing a user request. Data prefetching has been used to alleviate the storage access latency. Under current prefetching techniques, the storage system prefetches a batch of blocks upon detecting an access pattern. However, the high level of concurrency in today’s applications typically leads to interleaved blo...

متن کامل

Global-aware and multi-order context-based prefetching for high-performance processors

Data prefetching is widely used in high-end computing systems to accelerate data accesses and to bridge the increasing performance gap between processor and memory. Context-based prefetching has become a primary focus of study in recent years due to its general applicability. However, current context-based prefetchers only adopt the context analysis of a single order, which suffers from low pre...

متن کامل

Energy-Aware Data Prefetching for General-Purpose Programs

There has been intensive research on data prefetching focusing on performance improvement, however, the energy aspect of prefetching is relatively unknown. Our experiments show that although software prefetching tends to be more energy efficient, hardware prefetching outperforms software prefetching on most of the applications in terms of performance. This paper proposes several techniques to m...

متن کامل

Algorithms to Take Advantage of Hardware Prefetching

Cache-oblivious and cache-aware algorithms have been developed to minimize cache misses. Some of the newest processors have hardware prefetching where cache misses are avoided by predicting ahead of time what memory will be needed in the future and bringing that memory into the cache before it is used. It is shown that hardware prefetching permits the standard Floyd-Warshall algorithm for all-p...

متن کامل

Performance of multiuser network-aware prefetching in heterogeneous wireless systems

We study the performance of multiuser document prefetching in a two-tier heterogeneous wireless system. Mobility-aware prefetching was previously introduced to enhance the experience of a mobile user roaming between heterogeneous wireless access networks. However, an undesirable effect of multiple prefetching users is the potential for system instability due to the racing behavior between the d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

CTA-aware Prefetching for GPGPU

نویسندگان

چکیده

منابع مشابه

Context-Aware Prefetching at the Storage Server

Global-aware and multi-order context-based prefetching for high-performance processors

Energy-Aware Data Prefetching for General-Purpose Programs

Algorithms to Take Advantage of Hardware Prefetching

Performance of multiuser network-aware prefetching in heterogeneous wireless systems

عنوان ژورنال:

اشتراک گذاری